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Abstract 

Understanding how an animal can deform and articulate is essential for a realistic modification of its 3D model In this paper, 
we show that such information can be learned from user-clicked 2D images and a template 3D model of the target animal 
We present a volumetric deformation framework that produces a set of new 3D models by deforming a template 3D model 
according to a set of user-clicked images. Our framework is based on a novel locally-bounded deformation energy, where every 
local region has its own stiffness value that bounds how much distortion is allowed at that location. We jointly learn the local 
stiffness bounds as we deform the template 3D mesh to match each user-clicked image. We show that this seemingly complex 
task can be solved as a sequence of convex optimization problems. We demonstrate the effectiveness of our approach on cats 
and horses, which are highly deformable and articulated animals. Our framework produces new 3D models of animals that are 
significantly more plausible than methods without learned stiffness. 

Categories and Subject Descriptors (according to ACM CCS): L3.5 [Computer Graphics]: Computational Geometry and Object 
Modeling—Geometric algorithms, languages, and systems 


1. Introduction 

Recent advances in computer vision and graphics have enabled the 
collection of high-quality 3D models with tools such as multi-view 
stereo [FPIO] and commercial depth-sensors [IKH*11]. However, 
it is still difficult to obtain models of highly articulated and de¬ 
formable objects like animals. Today, searching Turbosquid for 
“chair” returns 24,929 results, while “cat” returns only 164 results. 
On the other hand, the Internet is filled with images of cats. The 
goal of our work is to create new 3D models of animals by mod¬ 
ifying a template 3D model according to a set of user-clicked 2D 
images. The user clicks serve as positional constraints that guide 
the shape modification. 

In order to modify the shape realistically, we argue that it is crit¬ 
ical to understand how an animal can deform and articulate. For 
example, looking at many images of cats shows that a cat’s body 
may curl up like a ball or twist and that its limbs articulate, but its 
skull stays mostly rigid. Hence, when modifying a cat 3D model, 
we should restrict the amount of deformation allowed around the 
skull, but allow larger freedom around limb joints and the torso. 

In this work, we propose a novel deformation framework that 
aims to capture an animal-specific 3D deformation model from a 
set of annotated 2D images and a template 3D model. Our frame¬ 
work is inspired by the idea of local stiffness, which specifies the 
amount of distortion allowed for a local region. Stiffness is used in 
3D surface deformation methods to model natural bending at joints 
and elastic deformations [PJS06, BPGK06]. In previous methods, 
the stiffness is provided by users or learned from a set of vertex- 


aligned 3D meshes in various poses [PJS06]. Instead, we learn stiff¬ 
ness from user-clicked 2D images by imposing sparsity; the idea is 
that large distortion is only allowed for those regions that require 
high deformation across many images. To our knowledge, our work 
is the first to learn stiffness of a 3D model from annotated 2D im¬ 
ages. 

Figure 1 shows an overview of our proposed framework. Given 
a stock 3D cat mesh and target images of cats, a user provides 3D- 
to-2D point correspondences by clicking key features in images. 
These are passed on to the proposed algorithm, which simultane¬ 
ously deforms the mesh to fit each cat’s pose and learns a cat- 
specific model of 3D deformation. In the end, we obtain new 3D 
models for each target image and a stiffness model that describes 
how cats may deform and articulate. 

Contributions: Our primary contribution is a deformation frame¬ 
work that learns an animal-specific model of local stiffness as it 
deforms the template model to match the user-clicked 2D-to-3D 
correspondences. Specifically, 

1. We propose a locally bounded volumetric deformation energy 
that controls the maximal amount of distortion applied to local 
regions of the model using the recent optimization techniques 
of [KABL14]. The bounds act as a local stiffness model of the 
animal, which we learn by imposing a LI sparsity penalty. The 
final deformation is orientation preserving and has worst-case 
distortion guarantees. 

2. We show that both the deformation and the stiffness bounds can 
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Figure 1: Overview. Our inputs are a template 3D model of an animal and a set of images with user clicked 3D-to-2D point correspondences. 
The algorithm then alternates between solving for the camera viewpoint and the 3D deformations for all images. Our novel formulation 
allows us to solve for the deformation for each image and the stiffness model of the animal jointly in a single semidefinite program (SDP). 
The outputs of our algorithm are a set of deformed 3D models and the stiffness model, which specifies the rigidity of every local region of the 
animal (red indicates high deformability and blue indicates rigidity). 


be solved jointly as a sequence of convex optimization prob¬ 
lems. 

3. We demonstrate the effectiveness of our framework on cats and 
horses, which are challenging animals as they exhibit large de¬ 
grees of deformation and articulation. 

2. Related Work 

Works that modify a template 3D model to fit images can be 
roughly divided into two categories: those that are class-agnostic, 
and those that have a class-specific deformation prior learned from 
data or provided by users. Methods that do not use any class- 
specific prior make use of strong image cues such as silhou¬ 
ettes [VdA13,hTgL10] or contour drawings [KSvdP09]. These ap¬ 
proaches focus on fitting a single 3D model into a single image, 
while we focus on learning a class-specific prior as we modify the 
template 3D model to fit multiple images. Recently, Kholgade et al. 
introduced an exciting new photo editing tool that allows users to 
perform 3D manipulation by aligning 3D stock models to 2D im¬ 
ages [KSES14]. Our approach complements this application, which 
is only demonstrated for rigid objects. 

More closely related to our approach are works that make use 
of prior knowledge on how the 3D model can change its shape. 
Many works assume a prior is provided by users or artists in the 
form of kinematic skeletons [GWBB09,AB15,TSR*14a,BTG*12, 
YY 14,TSR* 14b] or painted stiffness [PJS06]. Since obtaining such 
priors from users is expensive, many methods learn deformation 
models automatically from data [BV99,AKP*04,ASK*05,CKC10, 
dATTS08,LD14,SY07]. Anguelov et al. [AKP*04] use a set of 
registered 3D range scans of human bodies in a variety of config¬ 
urations to construct skeletons using graphical models. Blanz and 
Vetter [BV99] learn a morphable model of human faces from 3D 
scans, where a 3D face is described by a linear combination of 
basis faces. Given a rough initial alignment, they fit the learned 
morphable models to images by restricting the model to the space 
spanned by the learned basis. Similarly [ASK*05,HSS*09] learn a 
statistical model of human bodies from a set of 3D scans. Popa et 


al. [PJS06] learn the material stiffness of animal meshes by analyz¬ 
ing a set of vertex-aligned 3D meshes in various poses. 

One of the biggest drawbacks in learning from 3D data is that 
it requires a large set of registered 3D models or scans, which is 
considerably more challenging to obtain compared to a set of user- 
clicked photographs. All of these methods rely on 3D data with 
the exception of Cashman et al. [CF13]. They learn a morphable 
model of non-rigid objects such as dolphins from annotated 2D im¬ 
ages and a template 3D model. Our work is complementary to their 
approach in that they focus on intra-class shape variation such as 
fat vs thin dolphins, while we focus on deformations and articu¬ 
lations due to pose changes. The use of a morphable model also 
makes their approach not suitable for objects undergoing large ar¬ 
ticulations. 

Using 2D images requires camera parameters for projecting the 
deformed 3D models to image coordinates. Cashman et al. [CF13] 
assume a rough camera initialization is provided by a user, but we 
estimate the camera parameters directly from user-clicked 2D-to- 
3D correspondences. There are many works regarding the estima¬ 
tion of camera parameters from image correspondences, and their 
discussion is outside the scope of this paper. We refer the reader 
to [HZ04] for more details. 

There is a rich variety of mesh deformation techniques in the 
literature [BS08,ZHS*05,BPGK06,SSP07,SA07]. The main idea 
is to minimize some form of deformation objective that governs 
the way the mesh is modified according to user-supplied posi¬ 
tional constraints. Common objectives are minimization of the elas¬ 
tic energy [TPBF87] or preservation of local differential proper¬ 
ties [LSCO*04]. The solution can be constrained to lie in the space 
of natural deformations, which are learned from exemplar meshes 
[SP04,SZGP05,DSP06,MTGG11,PJS06]. Our approach is related 
to these methods, except that we learn the space of deformations 
from a set of annotated 2D images. [BS08] offers an excellent 
survey on linear surface deformation methods. While simple and 
efficient to use, surface deformation methods suffer from unnatu¬ 
ral volumetric changes for large deformations [ZHS*05,BPGK06]. 
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Our work is based on a volumetric representation, which we discuss 
in detail in the next section. 


3. Problem statement and background 

We consider the problem of modifying a template 3D mesh of an 
animal according to a set of user-clicked photographs of the target 
animal. Our goal is to produce plausible 3D models guided by the 
annotated images, not necessarily obtaining precise 3D reconstruc¬ 
tions of the images. In particular, given a sparse set of 2D-to-3D 
correspondences obtained from user-clicks, we wish to solve for a 
set of class-specific 3D deformations that faithfully fit the image 
annotations. 


More formally, we are given a 3D template model, represented 
by a surface mesh S C as well as N images of class instances 
• 5 ^^- Each image is associated with a sparse set of user pre¬ 
scribed point correspondences to the 3D model; namely, the i’th im¬ 
age f comes with pairs relating the surface point x[ eS 

to a 2D image location G Our goal is to leverage the N anno¬ 
tated images to learn a deformation model V capturing the possible 
deformations and articulations of the object class. In particular, for 
each image f we wish to find a deformation G D that maps its 
3D landmark points {x[} to their corresponding image points {p^} 
once projected to the image plane; namely, satisfying 
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where IT* G is the camera projection matrix for the /’th image. 
In what follows we assume weak perspective projection, which is 
an orthographic projection followed by scaling of the v and y coor¬ 
dinates: 
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ri and r 2 are the first two rows of the object rotation matrix, tif2 
are the first two coordinates of the object translation, and ^ spec¬ 
ifies the camera aspect ratio. Its parameters can be solved in a 
least squares approach given six or more 3D-to-2D point correspon¬ 
dences. Please see [HZ04] for more information. Note that perspec¬ 
tive projection may be similarly handled. 


3.1. Parameterized deformation model 

We parameterize the deformations of the surface model S by intro¬ 
ducing an auxiliary tetrahedral mesh enclosed within the surface, 
M = (V, T), where V G is a matrix of n coarse vertex coordi¬ 
nates and T = is a set of m tetrahedra (tets). Every surface 

point X G S can then be written as a linear combination of the ver¬ 
tices V. In particular, for the landmark points we set 
where G M” is a coefficient vector computed by linear moving 
least squares [Lev98]. Figure 2 shows the surface and the tetrahe¬ 
dral mesh of a template cat model. The use of a tetrahedral mesh 
introduces a notion of volume to the model making it more robust 
at preserving volumetric detail [dAST*08,ZHS*05]. 

Deformations of M thereby induce deformations of the sur¬ 
face S. Specifically, we shall consider continuous piece-wise linear 



The template cat surface Auxiliary tetrahedral mesh 

Figure 2: A template 3D surface and its auxiliary tetrahedral mesh 
with surface vertices shown in blue dots. 

(CPL) maps 0 : M ^ R^, whereby the deformation, restricted to 
the /th tet, is defined by the affine map v Ajv maps the 

vertices V to new locations U G R^^”. In fact, is uniquely deter¬ 
mined by the new vertex locations U; for the /th tet, the following 
full rank linear system holds 

( uj2 Up Up ) = [Aj tj] (y y y y), (3) 

where vj. and uj. are its four vertices in the original and the de¬ 
formed mesh respectively. 

We denote by Aj = A/U) the linear part of each affine trans¬ 
formation, linearly expressed in terms of the new vertex loca¬ 
tions U. Lastly, note that subject to a deformation = Ou the 
location of the landmark points can be simply expressed as x^ = 
This relationship along with the positional con¬ 
straints are depicted in Figure 3. 

3.2. Landmark-guided 3D deformation 

Our goal is to deform the template S such that (1) is satisfied with¬ 
out introducing local distortions to its shape. A popular approach 
to prevent distortion is minimizing the as-rigid-as-possible (ARAP) 
functional [ACOL00,SA07]: 

m 

/ARAP(U)=£||A,--i?,-||^|0-|, (4) 

J=1 

where Rj eSO{3) is the closest rotation to A^ and \ tj \ is the normal¬ 
ized volume of the /th tet. Intuitively, ARAP tries to keep the local 
transformations applied to each tet of the mesh as similar as possi¬ 
ble to a rigid transformation. Note that while the ARAP functional 
is non-convex, it is convex-quadratic for fixed rotations Rj. 



Figure 3: Illustration of the deformation model. 
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The ARAP functional minimizes the £ 2 -norm of a “non-rigidity” 
measure, which strives to evenly distribute local deviations from 
rigid transformation. As such, it fails to faithfully represent articu¬ 
lation and local deformations. Moreover, it is not straightforward to 
adapt this functional alone to benefit from having many annotated 
image exemplars. In this work, we also use the ARAP functional, 
but allow non-uniform distribution of distortion by assigning local 
stiffness as described in the next section. 

4. Learning stiffness for articulation and deformation 

Natural objects do not usually deform in a uniform manner; some 
parts such as joints deform a lot more while parts such as the limbs 
and skull stay rigid. In order to model such deformation and artic¬ 
ulation, we introduce the notion of local stiffness, which specifies 
how much distortion is allowed at each tet. We learn local stiffness 
from data using a sparsity promoting energy, so large deformations 
are concentrated in regions that require them across many images. 

We depart from the traditional skeleton models, which are a set 
of rigid sticks connected by deformable joints [ASK*05, YP08]. 
While skeletons excel at modeling articulation, they only possess 
two level of stiffness, rigid or not. In contrast, our model can repre¬ 
sent multiple levels of stiffness, which is essential for representing 
local deformations. Moreover, using local stiffness does not require 
prior knowledge, such as the number of sticks and joints. In this 
section we discuss how we simultaneously deform the template S 
to match each of the images /i,... ,7^^ while learning the stiffness. 

4.1. Modeling local stiffness 

Denote by the deformation mapping S to the /’th image /^ and 
by {A^j} the linear transformations associated with its tets. Inspired 
by [Lipl2,KABL14], we control deformations by explicitly impos¬ 
ing constraints on their linear parts. 

First we require that each A^- satisfies 

det(A^-) > 0, (5) 

which entails that the mapping is locally injective and orientation 
preserving; in particular, tets may not flip. Second, we bound the 
local isometric distortion with the constraint 

max I ||A^j| 2 , ||A^- || 2 |^l + £ + ‘^7 (6) 

where || • II 2 is the operator (spectral) norm. The small constant 
8 > 0 is common for all tets and governs the degree of global non¬ 
rigidity. Sj > 0 is the local stiffness for the /th tet controlling how 
much this particular tet may deform. Note that 8 and sj are not 
image specific (i.e. they are independent of i) and encode the class- 
prior of how an object class can deform and articulate. 

Intuitively, ||A ^||2 and ||A^- ^||2 quantify the largest change of 
Euclidean length induced by applying A^- to any vector. Therefore, 
Equation (6) bounds local length changes by a factor of Isj. 
If, for example, E = Sj = 0 then A^- must be a rotation; looser bounds 
allow “less locally isometric” deformations. In practice, 8 is set to 
a small value and is fixed throughout the experiments. 


4.2. Optimizing articulation and deformation 

Subject to these constraints, we propose minimizing an energy 
comprising three terms: 

/ = /deform + VpOS + fi/sTIFFNESS • (7) 

/deform is defined via the ARAP deformation energy (4) as 

1 ^ 

/deform = ffjl /ar.ap (U^ ). (8) 

^ i=l 

/pos is defined by 



which accounts for the user prescribed correspondences and the 
camera parameters, aiming to satisfy (1). Lastly, we set 

/stiffness = ||*s||l, (10) 

where s is the vector whose elements are the local stiffness bounds 
{sj}. This LI regularization encourages most si to be 0, so that only 
those tets that must distort are allowed to do so. 

Xisa. parameter that controls the trade-off between satisfying the 
constraints and preserving the original shape of M. r| is a param¬ 
eter that controls the strength of the stiffness regularization. As r| 
increases, it forces most Aj to stay rigid and as r| approaches 0 the 
solution approaches that of the ARAP functional and the positional 
constraints. See Section 5 for parameter settings. 

In conclusion, jointly deforming the template S to match each 
of the images /i,... ,/Ar, while estimating the local stiffness boils 
down to the following optimization problem: 

min /deform + '^fpos + ^/stiffness (11) 

{uT,{nA,s 

s.t. Ay-=Ay-(U^), V j = 1,... ,m, / = 1,..., A 
det(A^-) > 0, 

max|||A;||2,||A;- *||2| < l+e + i;, 

sj>0- 

Note that usually in prior work, deformations are solved inde¬ 
pendently for each set of positional constraints, since there is noth¬ 
ing that ties multiple problems together. Introducing a shared stiff¬ 
ness field allows us to leverage information from multiple images 
and improve the quality of results for all images. 

4.3. Realizing the optimization 

Optimizing (11) is not straightforward, as it involves the non- 
convex constraint (6). We realize these constraints in a convex 
optimization framework based on the construction presented in 
[KABL14] for optimization subject to bounds on the extremal sin¬ 
gular values of matrices. 

This previous work makes the observation that the set of matrices 
whose maximal singular value, amax, is bounded from above by 
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some constant F > 0 is convex and can be written as a linear matrix 
inequality (LMI): 

= |a € ^ o|. (12) 

It is further shown that for any rotation matrix R G SO{n), the set 

j;Cy= jiM ^y/j, (13) 

is a maximal convex subset of the non-convex set of matrices with 
non-negative determinant whose minimal singular value, G min . is 
bounded from below by some constant y > 0. This calls for an it¬ 
erative algorithm in which R is updated in each iteration so as to 
explore the entire set of matrices with bounded minimum singu¬ 
lar value. As suggested by [KABL14], a natural choice for R is the 
closest rotation to A. This choice, in turn, also minimizes the ARAP 
functional in Equation (4) for a fixed A. 

In order to employ the convex optimization framework of 
[KABL14], we rewrite the constraints (5) and (6) as 

1/cy < G min i^j) ^ ctmax(Ay) ^Cj and det(Ay) > 0, 

with Cj = This follows by observing that ||Ay ||2 = 

amax(A^) and ||A^- ^||2 = l/anun(A^). Plugging (11) into the 
framework of [KABL14] then yields the following optimization 
problem: 

min /deform + ^/pos + ii/stiffness (14) 

s.t. A^-=A^-(U^'), V; = l,...,m,/= 1,...,A 
A' 

A‘j e R'jCyi , 

Sj > 0, 

Py ^ (1 + £ + .yy), 

_ 1 _ <y. 

{l+e + sj) 

whose optimization variables are {U^},{r^},{y^} and s. 

Lastly, we note that the last constraint of (14) is convex; in fact, 
following a standard derivation (e.g., see [BV04]), it can be equiv¬ 
alently rewritten as the convex second-order cone constraint 

^4+ (Yy — (1 -h8 + ^y))2 < Jj + (1 -h8 + 5y). 

Therefore, with fixed {Ry } and {Fl^}, Equation (14) is a semidef- 
inite program (SDP) and can be readily solved using any SDP 
solver. However, note that the entire problem is not convex due 
to the interaction between Rpl]\ and F[^ Thus, we take a block- 
coordinate descent approach where we alternate between two steps: 
(a) update R^j and 11^ fixing U^ (b) update fixing R^j and 11^ via 
solving Equation (14). As in [KABL14], we find that allowing the 
surface to deform gradually makes the algorithm less susceptible to 
local minima. To this end, we initialize the procedure with a large 
Tj, which controls the degree of non-rigidity, and slowly reduce its 
value as the algorithm converges. This algorithm is outlined in Al¬ 
gorithm 1. 


Algorithm 1: Jointly solving for the deformations and the stiff¬ 
ness_ 

Input: Template 3D mesh S, its auxiliary tetrahedral mesh 
M = (V,T), and A 3D-to-2D annotated images {/} 
Output: N deformed auxiliary tetrahedral meshes vertices 

{U*}, the projection matrices {U*}, and the stiffness 
model s 
maxiter = 10; 

= V, /=1...A; // initialize 

for r| ^ T|max to T|rnin do // warm start 

t = 0; 

repeat 

Compute by solving Equation (1) with 
Compute the polar decompositions ; 

Update by solving Equation (14) 

with and 
t = t 1', 

until convergence ovt > maxiter 
_V=U‘«; 

return 


5. Experimental Detail 

We use our approach as described to modify a template 3D mesh 
according to the user-clicked object pose in 2D images. We first 
compare our approach with the recent method of Cashman et al. 
[CF13], which is the closest work to ours with publicly available 
source code [CF]. We then present an ablation study where key 
components of our model are removed in order to evaluate their 
importance and provide qualitative and quantitative evaluations. 

We experiment with two object categories, cats and horses. We 
collected 10 cat and 11 horse images in a wide variety of poses 
from the Internet. Both of the template 3D meshes were obtained 
from the Non-rigid World dataset [BBK07]. These templates con¬ 
sist of ~3000 vertices and ~6000 faces, which are simplified and 
converted into tetrahedral meshes of 510, 590 vertices and 1500, 
1700 tets for the cat and the horse respectively via a tet generation 
software [Si 15]. We manually simplify the mesh in MeshLab until 
there are around 300 vertices. We found automatic simplification 
methods over-simplify skinny regions and fine details, leading to 
a poor volumetric tet-representation. The cat template and its aux¬ 
iliary tetrahedral mesh are shown in Figure 2. The template mesh 
used for horses can be seen in Figure 7. For all experiments we set 
8 = 0.01, and 10. In order to allow gradually increasing levels 
of deformation, we use rjmax = 0.5 and Tjmin = 0 05 with 10 log- 
steps in between for all experiments. The values for r| and X were 
set by hand, but deciding on the values did not require much tuning. 

In each iteration, the camera parameters are computed using the 
2D-to-3D correspondences. We initialize the parameters using the 
direct linear transform algorithm and refine it with the sparse bun¬ 
dle adjustment package [HZ04, LA09]. In order to obtain annota- 
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tions, we set up a simple system where the user can click on 2D 
images and click on the corresponding 3D points in the template 
mesh. Our system does not require the same vertices to be anno¬ 
tated in every image. The average number of points annotated for 
each image for both cats and horses was 29 points. 

6. Results 

Comparison with [CF13] Cashman et al. employ a low resolution 
control mesh on the order of less than 100 vertices which is then 
interpolated with Loop subdivision. In order to apply their method 
to ours, we simplified our template mesh with quadratic decima¬ 
tion until we reach around 150 vertices while retaining the key 
features of the template mesh as much as possible (shown in inset). 

Since their method relies on silhouettes, 
we provide hand-segmented silhouettes to 
their algorithm along with the user-clicked 
points. We transferred the user-clicks from 
the full mesh to the simplified mesh by finding the closest 3D vertex 
in the simplified mesh for each labeled vertex in the full resolution 
mesh. We did not include points that did not have a close enough 
3D vertex due to simplification. On average 24 points were labeled 
for their experiment and we use their default parameters. 



Figure 4: Comparison with [CF13]: the first column shows the 
user-clicked input images, the second column shows the result of 
[CF13] and the third column shows the result of our proposed 
method. Two views are shown for each image, one from the final 
estimated camera and another from an orthogonal view point. Our 
method is more robust to large deformations and retains the vol¬ 
ume of the model. Note that silhouettes, along with the user-clicked 
points, are used to obtain the results for [ CF13 ]. 

Figure 4 compares the results obtained with the method of 
[CF13] and our model. Two views are shown for each result, one 
from the estimated camera pose and another from an orthogonal 
viewpoint. As the authors in [CF13] point out, their method fo¬ 
cuses on modeling shape and is not designed for highly articulated 
objects such as cats. Consequently, we can see that it has difficulties 
dealing with the wide range of poses present in our cat dataset. Re¬ 
gions such as limbs and tails especially lose their original shape. 


Their method is based on surface deformation, which does not 
have a notion of volume. This causes fiattening of the 3D mod¬ 
els as can been seen in the orthogonal views. Since we guarantee 
worst-case distortion and orientation preserving deformation of the 
auxiliary mesh, our surface reconstructions are well behaved com¬ 
pared to [CF13]. Recall that silhouettes, along with the user-clicked 
points, are used to obtain the results for [CF13]. 

Qualitative evaluation The 3D models in Figure 1 were obtained 
using our proposed framework. We now evaluate the effectiveness 
of the proposed framework by comparing the results without any 
distortion bounds (i.e. removing Equation (6)) and with constant 
distortion bounds (i.e. fixing sj to a constant). Qualitative results of 
this ablation study are shown in Figure 5. The first column shows 
input images along with their user-clicked points. The second col¬ 
umn shows results obtained with no bounds, leaving just the ARAP 
energy, which we refer to as Unbounded. This is similar to the 
approach used in [KSES14], but with volumetric instead of surface 
deformation. The third column. Uniform, shows results obtained 
with a uniform bound, where the stiffness 1 + 8 + 5^- is replaced 
with a single constant cj = 2 for all faces. This is the deformation 
energy used in [KABL14] applied to 2D positional constraints. The 
constant was slowly increased from 1 to 2 in a manner similar to r| 
in order to allow for increasing levels of deformation. Einally, in the 
last column we show results obtained with the proposed framework 
where the distortions are bounded with local stiffness. 

Eirst, notice the wide range of poses present in the images used; 
some are particularly challenging requiring large deformation from 
the template 3D mesh. In general. Unbounded concentrates high 
distortions near positional constraints causing unnatural stretching 
and deformation around limbs and faces. This is evident with horse 
legs in row 4 as Unbounded deforms them in an elastic man¬ 
ner. Uniform distributes the distortions, however, when the pose 
change from the template is significant, distortions spread out too 
much causing unrealistic results as seen in rows 2 and 3. The un¬ 
natural distortion of the faces is still a problem with Uniform. 
The proposed framework alleviates problems around the face and 
the horse limbs as it learns that those regions are more rigid. Please 
refer to the supplementary materials for comprehensive results of 
all cat and horse experiments. 

We provide visualizations of the learned stiffness model in Eig- 
ure 6 and 7. Eigure 6 visualizes the learned stiffness values for cats 
and horses in various poses. We show the centroid of tetrahedra 
faces colored by their stiffness values in log scale. Blue indicates 
rigid regions while red indicates highly deformable regions. Recall 
that there is one stiffness model for each animal category. The level 
of deformations present in the input images are well reflected in the 
learned stiffness model. Eor cats, the torso is learned to be highly 
deformable allowing the animal to twist and curl, while keeping 
the skull and limbs more rigid. Similarly for horses, the neck, the 
regions connecting the limbs as well as the joints are learned to be 
deformable while keeping skull, limbs, and buttocks region rigid. 
The fact that the buttocks is considered rigid is anatomically con¬ 
sistent, since horses have a rigid spine making them suitable for 
riding [JD80]. 

We also present segmentation results using the learned stiffness 
values as another form of visualization in Eigure 7. In order to ob- 
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Figure 5: Comparison of proposed approach results with its key components removed. User-clicked input images (first column). Un¬ 
bounded (second column) is the model without any bounds on the distortion leaving just the volumetric ARAP energy. Uniform (third 
column) is when the stiffness bounds (sj in Equation (6)) are replaced by a single constant, which is the approach of [KABL14] applied 
to 2D positional constraints. The last column shows the results with our complete framework where the stiffness bounds and the deforma¬ 
tions are jointly optimized. Without the animal-specific stiffness, distortions either spread out too much or concentrate around the positional 
constraints. 


tain the segmentations, we first transferred the stiffness values from 
tetrahedra faces to vertices by taking the mean stiffness of faces a 
vertex is connected to. Then, we constructed a weighted graph on 
the vertices based on their connectivity, where the weights are set 
to be the sum of the Euclidean proximity and the similarity of the 
stiffness values. We apply normalized cuts to partition this graph 
and interpolate the result to the surface mesh vertices using the pa¬ 


rameterization described in Section 3.1. We also show the segmen¬ 
tation results using just the Euclidean proximity as a comparison. 
Stiffness-based segmentation illustrates that regions which deform 
together as learned by our framework correspond to semantically 
reasonable parts. 

The learned stiffness model can be used as a prior to solve for 
stiffness-aware deformations of new annotated images. Eigure 8 
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Figure 8: Deformation results using the learned stiffness from 10 
cats as a fixed prior for new images. 


Figure 6: Visualization of the learned stiffness values. Blue indi¬ 
cates rigid regions while red indicates highly deformable regions. 


Stiffness-based 

segmentation 


Vertex distance 
segmentation 



Figure 7: Another visualization of the learned stiffness by means of 
segmentation. Segmenting the template mesh using stiffness illus¬ 
trates regions that deform together as learned by our framework. 
We see that they correspond to semantically reasonable segmenta¬ 
tions. We show segmentation results based on vertex distance alone 
as a comparison. 


shows the results of deforming the template to new input images 
via using the stiffness values learned from the previous experiment, 
i.e. the new images were not used to learn the stiffness. Similar to 
other experiments, we do warm start where the stiffness bounds are 
linearly increased from 1.01 to their actual value in 10 steps. The 
results are visually very similar to the results obtained when the 
stiffness was learned with those images along with the other 10 cat 
images. From this perspective, the joint optimization for the stiff¬ 
ness and the deformations using multiple images is the “training” 
(Figure 5), while the single-image optimization with a fixed stiff¬ 
ness prior is the “testing” (Figure 8). 


Table 1: Quantitative evaluation against ground-truth. The Lower 
the better for all metrics. 


Distortion error metric [YBS04] 


Methods 

Mean dist 

Stretch 

Edge 

Area 

Angle 

Unbounded 

0.291 

1.01 

0.156 

0.216 

0.159 

Uniform 

0.281 

1.01 

0.13 

0.198 

0.13 

Proposed 

0.287 

0.996 

0.105 

0.181 

0.085 


pare the results of the No Bound, Uniform, and the proposed 
approach. Using this method, we produce two images from each 
ground truth model and conduct the experiment with 20 images. 

We evaluate our method using several error metrics. First, we 
measure the distortions between the ground truth and the deformed 
models, which capture how natural the deformations are. We ar¬ 
gue this is the most important measure since obtaining plausible 
deformations is the main goal of our algorithm. For this we use 
the stretch, edge-length, area, and angle distortion errors as de¬ 
fined in [YBS04] by comparing the corresponding triangles. Ad¬ 
ditionally, we report the mean Euclidean distance between the 3D 
vertices, which measures how close the surface of the deformed 
models are to the ground truth. While a low Euclidean distance is 
desirable for 3D reconstruction, we do not expect a close match ev¬ 
erywhere due to ambiguities arising from a single view and sparse 
point constraints. In particular. Euclidean distance is not necessar¬ 
ily indicative of visual quality. We report the average error over 
all 20 input images. Before computing the error metrics, the de¬ 
formed and ground truth models are aligned by a similarity trans¬ 
form. The results are shown in Table 1. As expected, all methods 
attain comparable mean Euclidean distance to ground truth, while 
our approach obtains substantially lower errors in distortion met¬ 
rics. This demonstrates the advantage of learning stiffness from 
multiple images, yielding a more plausible deformation model. 


Quantitative evaluation Lastly, we conduct an evaluation against 
the ground truth by using pictures of a rendered 3D model as the 
input to our framework. Specifically, we use the TOSCA dataset 
[BBK08], which has 11 vertex-aligned models of cats in various 
poses. We take the neutral pose (catO) as the template and ran¬ 
domly project the other 10 models to produce images where the 
ground truth shape is known. We randomly sample 35 points and 
use them as the 3D-to-2D correspondences. In order to guarantee 
that these points are well distributed, we segment the model into 15 
components and make sure that there is at least one point from each 
component. These components correspond to key parts such as the 
paws, limbs, left and right ears, tail base and tip, etc. We corn- 


implementation details With an unoptimized MATLAB imple¬ 
mentation, training with 10 images took 4 hours and testing a sin¬ 
gle image with a learned stiffness prior took ~30 minutes. We 
use YALMIP [L()4] for the SDP modeling and MOSEK as the 
solver [AAOO]. Our biggest bottleneck is the SDP optimization due 
to many LMI constraints. Reducing the number of tets can signifi¬ 
cantly reduce the run-time. 

7. Limitations and Future Work 

Limitations of our current approach suggest directions for future 
work. One failure mode is due to a large pose difference between 


© 2016 The Author(s) 

Computer Graphics Forum © 2016 The Eurographics Association and John Wiley & Sons Ltd. 






A. Kanazawa & S. Kovalsky & R. Basri & D. Jacobs / Learning 3D Deformation of Animals from 2D Images 


the template and the target object, which may lead to an erroneous 
camera parameters estimate (e.g., local minima), as seen in row 5 
of Figure 5. Here, the head of the horse in the image is lowered 
for grazing while the head of the horse template is upright causing 
a poor initialization of the camera estimate. Using a user-supplied 
estimate of the viewpoint or automatic viewpoint estimation meth¬ 
ods like [TM15] are possible solutions. 

Another pitfall is that some parts may be bent in an unnatural 
direction as seen around the left ankle of the horse in row 4 of 
Figure 5. An interesting future direction is to make the distortion 
bounds dependent on the orientation of the transformation. This 
would allow the framework to learn that certain parts only deform 
in certain directions. 

Since we only have a single view for each target object, there 
is an inherent depth ambiguity, where many 3D shapes project to 
the same 2D observations. As such, some of our deformed mod¬ 
els do not have the “right” 3D pose when seen from a different 
view. This is illustrated in our supplementary video that shows 
360 degree views of the final models. One possibility is to combine 
our method with recent single image reconstruction approaches 
like [VCdAB14,CKTM15] that use a large image collection of the 
same object class to resolve the depth ambiguity. 

Our method could also be enhanced to prevent surface intersec¬ 
tions or reason about occlusion (e.g. if the point is labeled, it should 
be visible from the camera). Run-time is also an issue for adapting 
the stiffness model into a real-time posing application. This may be 
addressed by recent advancements in efficiently computing map¬ 
pings with geometric bounds [KABL15]. 

8. Conclusion 

Modifying 3D meshes to fit the pose and shape of objects in im¬ 
ages is an effective way to produce 3D reconstruction of Internet 
images. In order to fit object pose naturally, it is essential to under¬ 
stand how an object can articulate and deform, especially for highly 
deformable and articulated objects like cats. In this paper we pro¬ 
pose a method that can learn how an object class can deform and 
articulate from a set of user-clicked 2D images and a 3D template 
mesh. We do so by introducing a notion of local stiffness that con¬ 
trols how much each face of the mesh may distort. We jointly opti¬ 
mize for the deformed meshes and the stiffness values in an iterative 
algorithm where a convex optimization problem is solved in each 
iteration. Our experiments show that learning stiffness from mul¬ 
tiple images produces more plausible 3D deformations. We hope 
this motivates further developments in the area of automatic point 
correspondence for non-rigid objects, enabling fully-automated 3D 
modeling of animals from 2D images in the near future. 
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